System and method for early detection of impending failure of a data storage system

ABSTRACT

Various embodiments are disclosed of an early failure detection system for a data storage system that can experience difficulties, such as system failure or loss of data integrity, when it runs out of spare storage locations. Spare storage locations can be used by a data storage system to replace storage locations that have become defective. In one embodiment, a count is kept of the available spare storage locations in a system, or sub-system, and when the amount of available spare locations drops to a threshold value, an action can be taken to avoid the consequences of an impending system failure. In other embodiments, the available spare storage locations are monitored by keeping track of the percentage of initially available spare locations still remaining, by keeping track of the rate of new spare locations being used, or by other techniques. In various embodiments, the early failure detection system may be implemented using procedures, data structures, and hardware that reside in and that may be executed in various locations, or parts, of the data storage system. In various embodiments, the early failure detection system responds to detection of a possible impending failure by taking one or more of a variety of actions, including, for example, sending an alert notification, enabling additional storage capacity, copying portions of the data stored in the system to other secure storage locations, shutting the system down, and taking no action.

CLAIM FOR PRIORITY

[0001] This application claims the benefit of priority under 35 U.S.C. §119(e) of U.S. Provisional Application No. 60/257,760 filed on Dec. 22,2000 and entitled “SYSTEM AND METHOD FOR EARLY DETECTION OF IMPENDINGFAILURE OF SOLID-STATE STORAGE SYSTEMS” and U.S. Provisional ApplicationNo. 60/257,648 filed on Dec. 22, 2000 and entitled “SYSTEM AND METHODFOR INTER-CHIP BLOCK REPLACEMENT”; the entirety of both of theseapplications are incorporated herein by reference.

BACKGROUND OF THE INVENTION

[0002] 1. Field of the Invention

[0003] The present invention relates to systems and methods for managingdefects in a digital data storage system. More particularly, theinvention relates to systems and methods for early failure detection onmemory devices such as Flash EEPROM devices.

[0004] 2. Description of the Related Art

[0005] Computer systems typically include magnetic disk drives for themass storage of data. Although magnetic disk drives are relativelyinexpensive, they are bulky and contain high-precision mechanical parts.As a consequence, magnetic disk drives are prone to reliabilityproblems, and as such are treated with a high level of care. Inaddition, magnetic disk drives consume significant quantities of power.These disadvantages limit the size and portability of computer systemsthat use magnetic disks, as well as their overall durability.

[0006] As demand has grown for computer devices that provide largeamounts of storage capacity along with durability, reliability, and easyportability, attention has turned to solid-state memory as analternative or supplement to magnetic disk drives. Solid-state storagedevices, such as those employing Dynamic Random Access Memory (DRAM) andStatic Random Access Memory (SRAM), require lower power and are moredurable than magnetic disk drives, but are also more expensive and arevolatile, requiring constant power to maintain their memory. As aresult, DRAM and SRAM devices are typically utilized in computer systemsas temporary storage in addition to magnetic disk drives.

[0007] Another type of solid-state storage device is a Flash EEPROMdevice (hereinafter referred to as flash memory). Flash memory exhibitsthe advantages of DRAM and SRAM, while also providing the benefit ofbeing non-volatile, which is to say that a flash memory device retainsthe data stored in its memory even in the absence of a power source. Forthis reason, for many applications, it is desirable to replaceconventional magnetic disk drives in computer systems with flash memorydevices.

[0008] One characteristic of some forms of non-volatile solid-statememory is that storage locations that already hold data are typicallyerased before they re-written. Thus, a write operation to such a memorylocation is in fact an erase/write operation, also known as anerase/write cycle. This characteristic stands in contrast to magneticstorage media in which the act of re-writing to a location automaticallywrites over whatever data was originally stored in the location, with noneed for an explicit erase operation.

[0009] Another characteristic of some forms of non-volatile solid-statememory is that repeated erase/write operations can cause the physicalmedium of the memory to deteriorate, as, for example, due toTime-Dependent-Dielectric-Breakdown (TDDB). Because of thischaracteristic deterioration, non-volatile solid-state storage systemscan typically execute a finite number of erase/write operations in agiven storage location before developing a defect in the storagelocation. One method for managing operation of a data storage system inthe face of these defects is the practice of setting aside a quantity ofalternate storage locations to replace storage locations that becomedefective. Such alternate storage locations are known as spare storagelocations or “spares” locations. Thus, when a storage location defect isdetected during a write operation, the data that was intended forstorage in the now-defective location can be written instead to a“spares” location, and future operations intended for the now-defectivelocation can be re-directed to the new spares location. With this methodof defect recovery, as long as a sufficient number of spares locationshave been set aside to accommodate the defects that occur, the systemmay continue to operate without interruption in spite of the occurrenceof defects.

[0010] When a defect occurs and no free spares locations remain to serveas alternate data storage locations, the storage system can fail.Endurance is a term used to denote the cumulative number of erase/writecycles before a device fails. Reprogrammable non-volatile memories, suchas flash memory, have a failure rate associated with endurance that isbest represented by a classical “bathtub curve.” In other words, if thefailure rate is drawn as a curve that changes over the lifetime of amemory device, the curve will resemble a bathtub shape. The bathtubcurve can be broken down into three segments: a short, initially high,but steeply decreasing segment, sometimes called the “infant mortalityphase” during which failures caused by manufacturing defects appearearly in the life of a device and quickly decrease in frequency; a long,flat, low segment that represents the normal operating life of a memorydevice with few failures; and a short, steeply increasing segment,sometimes called the “wear-out phase,” when stress caused by cumulativeerase/write cycles increasingly causes failures to occur. Thus, towardsthe end of a device's life span, deterioration can occur rapidly.

[0011] Often, when a storage system fails, the data contained in thestorage system is partially or completely lost. In applications where ahigh value is placed on continued data integrity, storage systems proneto such data loss may not be acceptable, in spite of any otheradvantages that they may offer. For instance, a high degree of dataintegrity is desirable in a data storage systems that is used in arouter to hold copies of the router's configuration table, which cangrow to massive size for a very large router. A high degree of dataintegrity is also desirable in data storage systems used to holdtemporary copies of the data being transferred through a router. In thisinstance, ensuring a high level of data integrity is complicated by thefact that a very high number of erase/write operations are executedduring the operation of such an application.

[0012] A challenge faced by reliability engineers is how to monitor adevice's ability to cope with defects and to predict a device's failureso that data loss due to unanticipated system failures does not occur.

SUMMARY OF THE INVENTION

[0013] Spares locations in a digital data storage system are often setaside as alternate locations for data in the event that defects occur.As long as a sufficient number of spares locations remain available, adata storage system can handle the occurrence of new defects. When asystem runs out of spares, however, the system can fail and data can belost. In order to ensure the integrity of a data storage system, it isdesirable to be able to predict and to avoid such failures.

[0014] An inventive method and system for early failure detection in acomputer system is described herein that allows a digital data storagesystem to monitor the number of available spares remaining in some orall of its associated memory and to take appropriate preemptive actionto avoid the consequences of an unanticipated failure. The early failuredetection method and system can be implemented in a wide variety ofembodiments depending on the configuration, needs, and capabilities ofthe computer system.

[0015] In a data storage system or device that can run out of sparestorage locations for replacing defective storage locations, variousembodiments are disclosed of an early failure detection system. In oneembodiment, a count is kept of the available spare storage locations ina system, or sub-system, and when the amount of available sparelocations drops to a threshold value, an action can be taken to avoidthe consequences of an impending system failure. In other embodiments,the available spare storage locations are monitored by various othermethods, for example, by keeping track of the percentage of initiallyavailable spare locations still remaining, by keeping track of the rateof new spare locations being used, or by other techniques. Variousprocedures, data structures, and hardware for implementing the earlyfailure detection system may reside and may be executed in variouslocations, or parts, of the data storage system. Various actions may beundertaken by the early failure detection system upon detecting apossible impending failure, depending on the needs and capabilities ofthe system. Such actions may include, but are not limited to, sendingout an alert, copying data from jeopardized parts of the system tonon-jeopardized parts of the system, expanding the storage capacity ofthe system, and shutting down the system.

[0016] One embodiment of an early failure detection system for a flashmemory system is described in which the flash memory system designates aquantity of storage locations as spares locations that are assigned foruse as alternate storage locations in the event that defects occur. Theearly failure detection system comprises evaluating the quantity ofspares locations available for assignment as alternate storage locationsto determine if a threshold value has been reached and taking apreemptive action to avert impending failure of the flash memory systemin the event that the quantity of spares locations reaches the thresholdlimit.

[0017] In one embodiment, the early failure detection system is a methodcomprising assigning a quantity of storage locations within a storagedevice to serve as spare storage locations and predicting the usabilityof the storage device based on the quantity of unused spare storagelocations.

[0018] In one embodiment, the early failure detection system is a methodof determining the usability of a solid-state storage device whichcomprises assigning a quantity of storage locations within a solid-statestorage device to serve as spare storage locations in the event defectsoccur in the storage locations and predicting the usability of thesolid-state storage device based on the quantity of unused spare storagelocations.

[0019] In one embodiment, the early failure detection system is a methodof monitoring the life expectancy of a flash memory device thatcomprises: assigning a quantity of storage locations within a flashmemory device to serve as spare storage locations which are used whendefects occur in the flash memory device, comparing the number ofavailable spare locations with a predetermined threshold, and performingan action when the quantity of unused spare storage locations fallsbelow the predetermined threshold, so as to avoid the consequences of apotential failure of the flash memory.

[0020] In one embodiment, the early failure detection system isimplemented as a solid-state storage device comprising a plurality ofstorage locations, a plurality of spare storage locations that are usedwhen defects occur in the storage locations, and processor circuitryconfigured to predict the usability of the solid-state storage devicebased on the quantity of unused spare storage locations.

[0021] In one embodiment, the early failure detection system isimplemented as a flash memory device comprising a plurality of storagelocations, a plurality of spare storage locations, a predeterminedthreshold value, and processor circuitry configured to compare thenumber of available spare storage locations with the predeterminedthreshold, and to perform an action when the quantity of unused sparestorage locations falls below the predetermined threshold, so as toavoid the consequences of a potential failure of the flash memory.

[0022] In one embodiment, the early failure detection system is a methodof determining the usability of a solid-state storage device, comprisingassigning a quantity of storage locations within a solid-state storagedevice to serve as spare storage locations that are used when defectsoccur in the storage locations, monitoring the number of available sparestorage locations, and performing an action when the quantity of unusedspare storage locations falls below a desired amount, so as to avoid theconsequences of a potential failure of the solid-state storage device.

[0023] One embodiment of an early failure detection system for a digitaldata storage system is described that designates a quantity of storagelocations as spares locations that are assigned for use as alternatestorage locations in the event that defects occur, that evaluates thequantity of spares locations available for assignment as alternatestorage locations to determine if a threshold value has been reached,and that takes a preemptive action to avert impending failure of thedigital data storage system in the event that the quantity of spareslocations reaches the threshold limit.

[0024] For purposes of summarizing the invention, certain aspects,advantages and novel features of the invention have been describedherein. It is to be understood that not necessarily all such advantagesmay be achieved in accordance with any particular embodiment of theinvention. Thus, the invention may be embodied or carried out in amanner that achieves or optimizes one advantage or group of advantagesas taught herein without necessarily achieving other advantages as maybe taught or suggested herein.

[0025] Furthermore, although the early failure detection system isdescribed herein with respect to embodiments that implement solid-statenon-volatile memory, use of the system with respect to embodiments thatimplement non-solid-state memory is also contemplated.

BRIEF DESCRIPTION OF THE DRAWINGS

[0026]FIG. 1A is a high-level block diagram illustrating a generalcomputer system with solid-state storage that implements an embodimentof the early failure detection system.

[0027]FIG. 1B is a more detailed block diagram illustrating asolid-state storage system that implements an embodiment of the earlyfailure detection system.

[0028]FIG. 2 is a block diagram illustrating a plurality of memory areadivisions occurring on solid-state memory chips in accordance with oneembodiment of the early failure detection system.

[0029]FIG. 3 illustrates one embodiment of a structure for a sparescount response sector utilized in accordance with one embodiment of theearly failure detection system.

[0030]FIG. 4 illustrates a flowchart depicting one embodiment of amethod for early failure detection in a computer system.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0031] A system and method for detecting an impending failure of anon-volatile storage device is disclosed herein. In order to fullyspecify the preferred design, various embodiment-specific details areset forth. For example, the early failure detection system is describedwithin the example embodiment of a flash memory digital data storagesystem. It should be understood, however, that these details areprovided to illustrate the preferred embodiments, and are not intendedto limit the scope of the invention. The early failure detection systemis not limited to embodiments using flash memory, and other embodiments,including those that employ other types of storage devices, such asother solid-state memory systems and non-solid-state memory systems, arealso contemplated.

[0032]FIG. 1A illustrates one embodiment of a general configuration fora computer system 100 that can implement embodiments of the earlyfailure detection system disclosed herein. The computer system 100comprises a host system 102 and a plurality of storage devices, which inFIG. 1A are depicted as solid-state storage systems 110. The host system102 can be any of a variety of processor-based devices that store datain a digital data storage system such as the solid-state storage system110 shown in FIG. 1A. For example, the host system 102 could be a routerthat serves as a large network backbone, a Small Computer SystemInterface (SCSI) controller, a relatively small digital camera system,or any of a very large number of alternatives.

[0033] The host system 102 communicates with the solid-state storagesystems 110 by way of a system interface 104. The solid-state storagesystems 110 store data for the host system 102. A solid-state storagesystem 110 comprises a memory system controller 106, an array of one ormore memory cards 108, and a communication interface 114, by means ofwhich the memory system controller 106 communicates with the memory cardarray 108.

[0034] In various embodiments, the controller 106 can comprisecontroller circuitry, processor circuitry, processors, general purposesingle-chip or multi-chip microprocessors, digital signal processors,embedded microprocessors, micro-controllers, and the like. In theembodiment illustrated in FIG. 1A, the memory card array 108 can be anarray of flash memory cards. However, other types of memory media,including magnetic memory and other types of solid-state memory mediamay be used without departing from the spirit of the early failuredetection system. Similarly, the memory can be implemented on anindividual card, chip, device, or other component, or on a plurality orvariety of such cards, chips, devices, or other components.

[0035] On receipt of a command from the host system 102, the memorysystem controller 106 manages execution of the command. When the host102 issues a write command to the solid-state storage system 110, thecontroller 106 transfers data from the system interface 104 to a storagelocation in the array of memory cards 108. When the command is a readcommand, the controller 106 orchestrates a transfer of data from one ormore locations in the memory card array 108 that correspond to ahost-provided address received via the system interface 104. Thecontroller 106 transfers the data from the memory array 108 to the hostsystem 102, again by way of the system interface 104.

[0036] An early failure detection system, as described herein, can beimplemented in a computer system 100 to monitor memory locations and totake preemptive action if an impending memory failure is anticipated. Aswill be described in greater detail below, the early failure detectionsystem can be implemented in a variety of embodiments. In accordancewith some embodiments, early detection data 103, as well as associatedstructures, procedures, or code, may all be stored within the hostsystem 102. In accordance with some embodiments, early detection data107, again possibly accompanied by associated structures, procedures, orcode, may be stored with the memory system controller 106 of thesolid-state storage system 110. In other embodiments, early detectiondata 107, again possibly accompanied by associated structures,procedures, or code, may be stored, to various extents, in one or bothlocations.

[0037]FIG. 1B depicts a more detailed view of one embodiment of asolid-state storage system 110. As in FIG. 1A, FIG. 1B shows thesolid-state storage system 110 comprising a memory system controller 106that communicates with an array of one or more memory cards 108 via aninterface 114. The memory system controller 106 may store earlydetection data 107 for the use of the early failure detection system.FIG. 1B further shows that a memory card 108 comprises a memory cardcontroller 112 that communicates with an array 120 of one or more memorychips via a memory card interface 116. In accordance with someembodiments of the early failure detection system, early detection data113 may be stored within the memory card controller 112.

[0038]FIG. 2 illustrates a more detailed view of one embodiment of thememory array 120 comprising four memory chips 222. As illustrated inFIG. 2, each memory chip 222 of the memory array 120 comprises a memorystorage space 202, which is divided into a plurality of memory areas204, 206, 208, 210. In the embodiment illustrated, the storage area 202comprises a code storage area 204, a defect map area 206, a user dataarea 208, and a spares area 210.

[0039] Each of these memory areas 204, 206, 208, 210 is furthersubdivided into a plurality of individually erasable and addressablestorage locations 214, 216, 218, 220, also called rows. In oneembodiment, a row 214, 216, 218, 220 typically holds a plurality ofsectors for storing data and a sector for holding control data usable bythe memory card controller 112 in managing the memory card 108.

[0040] The code storage area 204 is a memory storage area for machinefirmware code that provides instructions to the memory card controller112. The user data area 208 is a memory storage area for data suppliedby, and for the use of, the host system 102. As illustrated, the userdata area 208 comprises most of the memory space 202 within the memorychip 222. In one embodiment, data read and write commands sent by thehost system 102 to the memory card controller 112 have an associatedhost-provided logical address that identifies the desired data. Thememory card controller 112 attempts to identify an associated location218 in the user data area 208 that corresponds to the host-providedlogical address and that holds, or will hold, the desired data, so thatthe host command can be executed.

[0041] When a defect develops in a user data area location 218, in someembodiments the location 218 is no longer useful for data storagepurposes, and the memory card controller 112 attempts to identify analternate, non-defective storage location for the data associated withthe host-provided logical address.

[0042] In one embodiment, the spares area 210 comprises alternatestorage locations that have been set aside for data that was previouslylocated in user data area locations 218 that have developed defects. Inthe event that a defect in a user data area location 218 is detectedduring an erase/write operation, an unused alternate location 220 in thespares area 210 can be used for writing the data and can be assigned tothe host-provided logical address for future data access needs.

[0043] The defect map area 206 is a memory storage location for a defectmap, which, in one embodiment, is a list of relocation information fordata items that have been relocated from the user data area 208 to thespares area 210 due to the development of defects in their originalstorage locations. In one embodiment, for each moved data item, thedefect map 206 comprises a logical identifier for the data item, as wellas a reference to a new location in the spares area 210 to which thedata item has been moved. Thus, the defect map 206 can be used to locatedata that have been moved to the spares area 210.

[0044] Although FIG. 2 shows the memory chip 222 subdivided intodistinct areas and having a distinct organization, the types, locations,and organization of memory areas in the memory space 202 of the memorychip 222 may be substantially altered without detracting from the spiritof the early failure detection system.

[0045] Similarly, although FIG. 2 shows the memory array 120 comprisingfour substantially similar memory chips 222, the number and types ofmemory chips may be substantially altered without detracting from thespirit of the early failure detection system.

[0046]FIG. 3 shows one embodiment of a spares count response sector 300that can be sent from the controller 106 of a solid-state storage system110 to a host system 102 to report on the spares area locations 220still free to be assigned on the memory cards 108 of the solid-statestorage system 110. In the example embodiment shown in FIG. 3, thespares count response sector 300 is a binary data sector in which tenbytes are used to report on the spares areas 210 in a solid-statestorage system 110 that has eight memory cards 108. In FIG. 3, Bytes“1”-“8” 320 correspond to the eight memory cards 108 of the solid-statestorage system 110 and are used to store the number of available spareslocations 220 for their respective memory cards 108. The eight bits 315of Byte “0” 310 correspond to the eight Bytes “1”-“8” 320 and are usedto indicate whether or not the spares count in the corresponding byte320 is valid. For example, in one embodiment, if a bit “0” 315 of Byte“0” 310 is set to equal “1,” then the corresponding count for Card 1, asstored in Byte “1” 320, is deemed to be valid. In the embodimentdepicted in FIG. 3, Byte “9” 330 stores a cumulative total of unusedspares locations 220 available for the solid-state storage system 110.

[0047]FIG. 4 presents a flowchart depicting one embodiment of a process400 for the early detection of impending failure due to lack of spareslocations 220 in a computer system 100. In FIG. 4, the process 400 isdescribed in a generic form that may be implemented in a variety ofembodiments, a sampling of which will be described below. In oneembodiment, the process 400 monitors the amount of free spares locations220 available to the system 100 and notes when the amount of availablespares locations 220 reaches or drops below a threshold amount. In theevent that the amount of available spares locations 220 drops below thethreshold amount, the process 400 may trigger one or more of a varietyof responses, some examples of which are described in greater detailbelow.

[0048] As described above with reference to FIGS. 1A and 1B, thecomputer system 100 may be configured in a wide variety ofconfigurations depending on the functions, the storage capacities, andother requirements and parameters of the system 100. In particular, thememory capacity of the system 100 may be configured in a variety ofconfigurations. In one embodiment, a host system 102 may be associatedwith a plurality of storage systems. For example, the host system 102 asdepicted in FIG. 1A is associated with a plurality of solid-statestorage systems 110, at least one of which comprises a plurality ofmemory cards 108, at least one of the memory cards 108 comprising aplurality of memory chips 222. In another embodiment, the host system102 is directly associated with a plurality of memory cards 108. In yetanother embodiment, the host system 102 is associated with a singlememory card 108 that comprises eight memory chips 222. In someembodiments, a spares area 210 is set aside on each chip 222 for therelocation of data from locations in the user data area 208 that havedeveloped defects. In some embodiments, a chip 222 that runs out of itsown available spares locations 220 fails; in other embodiments, a chip222 that runs out of spares locations 220 may use available spareslocations 220 in another part of the computer system 100, and thisextends its life span.

[0049] In accordance with this variety of possible configurations of thecomputer system 100, the process 400 described in FIG. 4 may be executedin a variety of locations in the computer system 100 and may serve tomonitor all of the spares locations 220 available to the system 100, ora portion of the spares locations 220 available to the system 100, or acombination of the two. For example, in one embodiment, the process 400is implemented within the host system 102, which receives informationabout the available spares locations 220 in the individual memory cards108 of its various solid-state storage systems 110 via the systeminterface 104. In one embodiment, the process 400 is implemented withinthe host system 102 which receives information about a total aggregatedamount of available spares locations 220 on each solid-state storagesystem 110. In one embodiment, the process 400 is implemented separatelywithin the memory system controller 106 of each solid-state storagesystem 110 where the process 400 monitors the available spares locations220 in the storage system's 110 array of one or more memory cards 108via an interface 114 with the memory cards 108. Such an embodiment ofthe process 400 may communicate any necessary and related information tothe host system 102 via the system interface 104. In one embodiment, theprocess 400 is implemented within the controller 112 of a memory card108 to monitor the available spares locations 220 on the memory card's108 memory chip array 120. In one embodiment, the process 400 may beimplemented in an auxiliary location of the computer system 100, or inmore than one of the locations described herein, or in other locations,or in a combination of these and other locations.

[0050] As shown in FIG. 4, the process 400 begins at start state 410 andcontinues to state 420, where an updated spares count is received. Thespares count is information about the amount of spares locations 220still available for use as alternate storage locations, and the sparescount can be implemented in a number of different embodiments. Forexample, in one embodiment, the spares count is the number of spareslocations 220 still available on a given memory chip 222. In oneembodiment, the spares count is the number of spares locations 220 stillavailable on a plurality of memory chips 222. The spares count responsesector 300 illustrated in FIG. 3 is one embodiment of a structure thatcan be used to report on the number of spares locations 220 stillavailable on each of an array of eight memory cards 108 as well as onthe total number of spares locations 220 still available on the array ofmemory cards 108. In one embodiment, the spares count 220 is,conversely, the number of spares locations 220 that have been used andthat are no longer available for use as alternate storage locations. Inone embodiment, the spares count is a percentage value, or set ofvalues, that indicates the percentage of remaining spares locations 220on one or more memory chips 222. In one embodiment, the spares count mayrely upon the knowledge that some types of non-volatile solid-statememory exhibit a steeply increasing defect rate near the end of theirusable life-span, and the spares count may accordingly indicate a rateof defect occurrence or a measure of acceleration in a rate of defectoccurrence. These and other embodiments of a spares count update arecontemplated and fall within the scope of the early failure detectionsystem.

[0051] In one embodiment, the receipt of an updated spares count maycome in response to a request that is triggered by a timer set toinitiate an update request after a fixed period of time has elapsed. Inanother embodiment, the receipt of an updated spares count may come inresponse to a request that is triggered by a timer set to initiate anupdate request after a varying period of time has elapsed. In oneembodiment, the receipt of an updated spares count may come in responseto a request that is triggered by a timer set to initiate an updaterequest after a fixed or a varying period of device operation time haselapsed since a last update. In one embodiment, the receipt of anupdated spares count may come in response to a request that is triggeredby a counter set to send out an update request after a given number ofone or more erase/write operations, or overall system operations, orother activity. In one embodiment, the receipt of an updated sparescount may come in response to a request that is triggered by anincreased rate of defect occurrence. In one embodiment, updated sparescount information may be gathered and reported as a background activitythat executes whenever a processor is free to do so.

[0052] As described above, the process 400 may be implemented in avariety of locations within a computer system 100. Similarly, theprocess 400 may cause the updated spares account to be received in anyof these or other locations.

[0053] After receiving an updated spares count in state 420, the process400 moves on to state 430, where the updated spares count information isevaluated to see if the amount of available spares locations has reacheda threshold value that signals an impending failure of part or all ofthe computer system 100. With respect to state 430, a variety ofembodiments exist. In one embodiment, for example, the threshold valueis pre-determined; in another embodiment, the threshold value isdetermined dynamically. In one embodiment, for example, a thresholdvalue is determined and is applied uniformly to all similarly sizedmemory units. In another embodiment, a threshold value is determinedindividually for each memory unit based on a count of the unit's initialnumber of spares locations 220. The evaluation process of state 430 maytake place in the host system 102, in a solid-state storage system 110,in a memory card 108, or in some other location or combination oflocations. Similarly, the evaluation may be embodied in a number ofdifferent forms. A threshold value or percentage may be stored forcomparison with the received spares count update. For example, a valuethat represents 2%, or 5%, or 20%, or some other portion of the originalamount of locations set aside to be spares locations 210 may bedesignated as a lower acceptable bound, or threshold, on the amount ofremaining spares locations before failure-preventive measures areundertaken by the system 100. Alternately, an updated spares count canbe compared with an original number of available spares locations 220,which may be stored in an early detection data location 103, 107, 113 inthe host system 102, in a solid-state storage system 110, in a memorycard 108, or in some other location or combination of locations.

[0054] Once the updated spares count or counts have been evaluated instate 430, the process 400 moves on to state 440, where the process 400determines if a threshold value has been reached.

[0055] If no threshold value has been reached, the process 400 moves onto state 450 where the process continues waiting for a next spares countupdate to be triggered. As described above with respect to state 420,many embodiments exist for triggering a spares count update request.Accordingly, in state 450, the process 400 may prepare to wait for thenext trigger by resetting any necessary timers or counters or registers,by updating stored values, by making notations in a log that may bestored or viewed by system administrators, by communicating with otherparts of the computer system 100, or by performing other actions.Alternately, no action may be required at this point of the process 400.Once any such preparations for continued waiting have been executed, theprocess 400 moves on to state 470, where the process 400 is complete andwaiting for the next spares count update can commence.

[0056] Returning to state 440, if the process 400 determines that one ormore threshold values have been reached, the process 400 moves on tostate 460 where preemptive action can be taken to avert failure of allor part of the system 100. With respect to state 460, a variety ofembodiments of preemptive actions exist. For example, in one embodiment,when the number of available spares locations 220 drops to a thresholdvalue, the system can send an alert message to a user or to a controlsystem to have the computer system 100, or a part of the system 100,turned off until the situation can be rectified. In one embodiment, allor part of the data stored on device in danger of impending failure canbe copied to another storage device automatically, and operation of thesystem 100 can continue with little or no interruption. In oneembodiment, back-up storage locations or devices can be activated andused to reduce the load on devices in danger of impending failure. Inone embodiment, software is activated to allow for the increased sharingof spares areas 210 across chips 222 or cards 108 or other memorydevices. In one embodiment, information is updated and stored. Inanother embodiment, information is communicated to other parts of thesystem 100. In one embodiment, no preemptive action is taken. These andother embodiments of a preemptive response to an evaluated impendingfailure are contemplated and fall within the scope of the early failuredetection system.

[0057] While certain embodiments of the invention have been described,these embodiments have been presented by way of example only, and arenot intended to limit the scope of the inventions. The early failuredetection system may be embodied in other specific forms withoutdeparting from the essential characteristics described herein.Accordingly, the breadth and scope of the invention should be defined inaccordance with the following claims and their equivalents.

What is claimed is:
 1. An early failure detection method for a flashmemory system wherein the flash memory method designates a quantity ofstorage locations as spares locations, the spares locations beingassigned for use as alternate storage locations in the event thatdefects occur, the early failure detection system comprising: evaluatingthe quantity of spares locations available for assignment as alternatestorage locations to determine if a threshold value has been reached;and in the event that the quantity of spares locations reaches thethreshold limit, taking a preemptive action to avert impending failureof the flash memory system.
 2. A method of determining the usability ofa solid-state storage device, wherein the solid-state storage devicecomprises spare storage locations for use in the event a defect occursin other storage locations, the method comprising predicting theusability of the solid-state storage device based on the quantity ofunused spare storage locations.
 3. The method of claim 2, furthercomprising assigning a quantity of storage locations within asolid-state storage device to serve as spare storage locations in theevent defects occur in the storage locations.
 4. The method of claim 2,wherein the act of predicting the usability of the solid-state storagedevice comprises determining whether the quantity of unused sparestorage locations is less than a predetermined threshold amount.
 5. Themethod of claim 2, wherein the act of predicting comprises comparing theamount of unused spare storage locations to an original amount of sparestorage locations.
 6. The method of claim 2, wherein the act ofpredicting comprises monitoring the frequency of defects occurring. 7.The method of claim 2, wherein the act of predicting comprisesmonitoring the rate of change in the frequency of defects occurring. 8.The method of claim 2, wherein the act of predicting calculates acurrently available amount of spare storage locations as a percentage ofan initially available amount of spare storage locations.
 9. A method ofmonitoring the life expectancy of a flash memory device, wherein thesolid-state storage device comprises spare storage locations for use inthe event a defect occurs in other storage locations, the methodcomprising: comparing the number of available spare locations with apredetermined threshold; and performing an action when the quantity ofunused spare storage locations falls below the predetermined threshold,so as to avoid the consequences of a potential failure of the flashmemory.
 10. The method of claim 9, further comprising assigning aquantity of storage locations within a flash memory device to serve asspare storage locations wherein the spare storage locations are usedwhen defects occur in the flash memory device
 11. The method of claim 9,wherein the predetermined threshold is stored in a controller in theflash memory device.
 12. The method of claim 9, wherein thepredetermined threshold is stored in a memory array associated with theflash memory device.
 13. The method of claim 9, wherein thepredetermined threshold is stored in a host system that stores data inthe flash memory device.
 14. The method of claim 9, wherein thepredetermined threshold is calculated as a percentage of an initialnumber of spare storage locations available within the flash memorydevice.
 15. The method of claim 9, wherein the predetermined thresholdis calculated as a percentage of an average number of spare storagelocations typically available within a flash memory device similar inmemory capacity to the flash memory device.
 16. A solid-state storagedevice comprising: a plurality of storage locations; a plurality ofspare storage locations wherein the spare storage locations are usedwhen defects occur in the storage locations; and processor circuitryconfigured to predict the usability of the solid-state storage devicebased on the quantity of unused spare storage locations.
 17. Thesolid-state storage device of claim 16, wherein the processor circuitryis further configured to send a notification regarding the usability ofthe solid-state storage device.
 18. The solid-state storage device ofclaim 16, wherein the processor circuitry is further configured todisplay the quantity of unused spare storage locations.
 19. Thesolid-state storage device of claim 16, wherein the processor circuitryis further configured to copy data from some storage locations to otherstorage locations.
 20. The solid-state storage device of claim 16,wherein the processor circuitry is further configured to automaticallyenable the addition of supplemental storage locations for use by thesolid-state storage device.
 21. The solid-state storage device of claim16, wherein the processor circuitry is further configured to enable amanual addition of supplemental storage locations for use by thesolid-state storage device.
 22. A flash memory device comprising: aplurality of storage locations; a plurality of spare storage locations;a predetermined threshold value; and processor circuitry configured tocompare the number of available spare storage locations with thepredetermined threshold, and wherein the processor circuitry is furtherconfigured to perform an action when the quantity of unused sparestorage locations falls below the predetermined threshold, so as toavoid the consequences of a potential failure of the flash memory. 23.The flash memory device of claim 22, wherein the flash memory device isa flash memory card.
 24. The flash memory device of claim 22, whereinthe flash memory device is a flash memory chip.
 25. The flash memorydevice of claim 22, wherein the flash memory device is an array of flashmemory cards.
 26. The flash memory device of claim 22, wherein storagelocations can be dynamically allocated as spare storage locations. 27.The flash memory device of claim 22, wherein the action performed by theprocessor circuitry allows for the use of other unused spare storagelocations accessible by the flash memory device to serve as supplementalspare storage locations.
 28. A method of determining the usability of asolid-state storage device, the method comprising: assigning a quantityof storage locations within a solid-state storage device to serve asspare storage locations wherein such spare storage locations are usedwhen defects occur in the storage locations; monitoring the number ofavailable spare storage locations; and performing an action when thequantity of unused spare storage locations falls below a desired amount,so as to avoid the consequences of a potential failure of thesolid-state storage device.
 29. The method of claim 28, whereinmonitoring the number of available spare storage locations takes placewithin the memory device.
 30. The method of claim 28, wherein monitoringthe number of available spare storage locations takes place within ahost system that uses the memory device to store data.
 31. The method ofclaim 28, wherein monitoring the number of available spare storagelocations takes place within the controller of the memory device. 32.The method of claim 28, wherein monitoring the number of available sparestorage locations takes place within a peripheral controller.
 33. Themethod of claim 28, wherein monitoring the number of available sparestorage locations takes place within a bus controller.
 34. The method ofclaim 28, wherein monitoring the number of available spare storagelocations takes place within any processor configured to monitor thememory device.
 35. An early failure detection system for a digital datastorage system that designates a quantity of storage locations as spareslocations, the spares locations being assigned for use as alternatestorage locations in the event that defects occur, the early failuredetection system comprising: evaluating the quantity of spares locationsavailable for assignment as alternate storage locations to determine ifa threshold value has been reached; and in the event that the quantityof spares locations reaches the threshold limit, taking a preemptiveaction to avert impending failure of the digital data storage system.36. The method of claim 35, wherein evaluating the quantity of spareslocations available for assignment is carried out by referring to acounter that is incremented each time a new spares location is used. 37.The method of claim 35, wherein evaluating the quantity of spareslocations available for assignment is carried out by counting allavailable spares locations at predetermined time intervals.
 38. Themethod of claim 35, wherein evaluating the quantity of spares locationsavailable for assignment is carried out upon request by a host system102.
 39. A system for determining the usability of a solid-state storagedevice, wherein the solid-state storage device comprises spare storagelocations for use in the event a defect occurs in other storagelocations, the system comprising: means for monitoring the number ofavailable spare storage locations; and means for performing an actionwhen the quantity of unused spare storage locations falls below adesired amount, so as to avoid the consequences of a potential failureof the solid-state storage device.